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1. INTRODUCTION 

Data analytics is a current buzzword in the computer industry. With immense development in digital 
technology, the amount of digital data generated is increasing day by day. Access to easy devices like 
smartphones and closed-circuit television (CCTV) cameras has contributed to vast increase in the image and 
video data. Analyzing this data manually has become a tedious and time-consuming task. To tackle this 
problem various algorithms and methods have been proposed for automatic video and image analysis. This 
area of research is well known by the name intelligent video analysis and finds applications in intelligent 
video surveillance, human-computer interaction, robotics, smart health care, smart home [1]. Human action 
recognition (HAR) is an integral part of intelligent video analytics. 

Human action is defined by Herath et al. [2] as "Action is the most elementary human-surrounding 
interaction with a meaning". Human actions are broadly divided into gestures, simple actions, interactions, 
and group activities. Moving of a palm or nodding of the head is considered as a gesture. A person walking, 
jumping or bending is considered as a simple action. Handshake by two people or one person pushing other 
is considered as a human-human interaction. A person walking with a dog or picking a bag is considered a 
human-object interaction. More than two people talking or dancing is considered a group action [3]. 
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Even after being researched for many years, human action recognition remains a challenging task 
because of its vast scope. The main challenge in human action recognition is that there is no limit to actions 
that can be performed by a human being. Actions like jogging, walking, and running can create confusion for 
an automatic action recognition system. Another obstacle in recognition is that there is a large diversity in the 
way in which a particular action is performed. This gives rise to high intraclass variation. Other conditions 
like varied camera angles, camera motion, scale changes, illumination changes, cluttered background, and 
occlusion add to the challenges faced by the automatic human action recognition system. 

Most of the work in this area uses the supervised learning approach. The main steps in the HAR 
system are feature extraction, feature selection and training a classifier with extracted features [4] for the 
classiifcation. The choice of features to be extracted for representing the action depends on the type of action. 
Algorithms proposed for gestures recognition, simple action recognition, and group action recognition differ 
mainly in the selection of features. Various classifiers like the k-nearest neighbor, support vector machine, 
and neural network are explored for classification purposes. It is observed that, along with the selection of 
appropriate feature, selection of appropriate classifier plays an important role in HAR performance. This 
work focuses on comparison of two neural network architectures, namely, feed forward neural network 
(FFNN) and cascade foraward neural network (CFNN) for human action recognition. two hand-crafted 
features, namely, histogram of optical flow (HOF) and histogram of gradients (HOG) are used for 
representing the action. Experimentation is carried out on well known Weizmann, KTH, UT interaction, and 
University of Central Florida (UCF) sports action datasets. Recognition accuracy is used as a performance 
parameter for comparing the architectures. The highest recognition accuracy of 97.59% is achieved for UT 1 
interaction dataset with FFNN architecture. The accuracy can be improved further by using different hand- 
crafted features. 

Earlier work on human action recognition shows the use of various hand-crafted features. Hand- 
crafted features are divided into two categories as local features and global features. Local feature defines the 
object in parts and then these parts are combined to form a local feature descriptor. Global feature defines an 
object as a whole. Each type of feature has its advantage and disadvantage. Previous work in this domain has 
emphasized the need of using multiple features to describe an action. As one type of feature can capture only 
one of the properties of a video, multiple features always help in describing an action efficiently as proved in 
[4]. Many researchers have combined local and global features for increasing recognition accuracy. Region 
of interest is detected before extracting actual features in many approaches [5]-[8]. 

Bak et al. [5] have used a deep learning approach for salient region detection. Various fusion 
mechanisms are explored for assimilating spatial and temporal features. Abdulmunem et al. [6] have 
proposed the use of an support vector machines (SVM) classifier for classifying objects described by a 
combination of global and local features. 3D gradient location and orientation histograms (GLOH) vector is 
proposed by Abdulmunem et al. [7]. 3D GLOH combines gradient locations and orientation histogram. In 
Duta et al. [8] new feature encoding methods namely vector of locally aggregated descriptors (VLAD) and 
spatio-temporal (ST_VLAD) are proposed by Ionut C. and others. The proposed method gives comparable 
results on datasets used for testing. A detailed study of the bag of visual word model using local features 
applied to human action recognition is given in [9]. 

A new bag of visual word framework called hybrid super vector is also proposed in this paper which 
gives promising results. A new feature descriptor using the fusion of stationary wavelet transform (SWT) and 
local binary pattern (LBP) is proposed in [10]. A discrete wavelet transform (DWT) based method is 
proposed in [11]. Four-level DWT is applied to find the features and then the stepwise linear discriminant 
analysis is applied for finding key features to be used in training. 3D stationary wavelet transform is used to 
describe the action in [12], [13]. Accumulate motion image (AMI) and motion AMI history image (MHI) are 
introduced in [14]. DWT features are further extracted from AMI and LBP features are extracted from MHI 
images to form a feature descriptor. Jyotsna et al. [15], HOG along with principal component analysis (PCA) 
is used to describe the action after applying the segmentation. K nearest neighbor classifier is used as a 
classifier which gives recognition accuracy of 94% on Weizmann and 91.83% on the KTH dataset. HOG is 
successfully used for face recognition in intelligent surveillance system in [16]. A combination of HOG and 
local feature swine confinement worker (SWF) [17] is seen to give high classification accuracy for UT 
interaction and UCF sports datasets. 

Artificial neural networks (ANN) have reformed the domain of machine learning. ANNs are widely 
used in human action recognition problems because of their capability to map complex inputs to outputs. 
Several types of ANNs are explored for classifying different types of data. Teixeira and Fernandes [18] have 
compared performances of feed-forward neural network and cascade forward neural network for time series 
domain to prove the advantage of cascade forward neural network. Dhanaseely [19] have presented results 
obtained by feed-forward neural network and cascade forward neural network for face recognition dataset 
with principal component analysis used as a feature. Badde et al. [20] and Goyal [21] use of feed-forward 
backpropagation networks and cascade forward backpropagation networks are explored in the civil 
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engineering domain. Cascade forward network is shown to give better accuracy in comparison to feed- 
forward network. 


2. PROPOSED METHOD 

The method for human action recognition proposed in this work is given in this section. Block 
schematic of the proposed method is given in Figure 1. A test video is converted to frames and preprocessed 
for de-noising. For human action recognition, spatial as well as temporal information is important. A 
histogram of gradients is used here to represent spatial information of the action. Histogram of optical flow 
represents the temporal or motion features of the action. Feature selection is done using Principal component 
analysis. PCA is applied to both the features separately to reduce the dimensionality. HOG and HOF features 
are then concatenated to form a final feature descriptor. A neural network is then trained with these feature 
descriptors and used to classify the test video. The following sub-sections describe each step-in detail. 


Apply PCA for 
dimensionalit 
y reduction 
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Neural 
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Input a Test 
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dimensionalit 
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Figure 1. Block schematic for proposed human action recognition system 


3.1. Histogram of oriented gradient (HOG) 

Histogram of oriented gradients (HOG) describes an object in a frame using its spatial information. 
HOG was first developed for recognizing human figures from an image. Hence HOG becomes a perfect 
choice for representing spatial features in the human action recognition task. The original algorithm [22] 
which was developed for an image is applied to a video here by considering each frame as an image. Each 
frame is divided into cells which are small spatial regions and for each pixel magnitude and orientation of 
gradient are calculated. 1D histograms are then used to represent each cell forming the HOG feature. 


3.2. Histogram of optical flow (HOF) 

Histogram of optical flow is proved to have the capability of representing human body motion [23]. 
Optical flow is nothing but a pattern of apparent movement in a sequence that represents relative motion 
between the observer and the sequence. It is obtained by finding changes in the position of the object in two 
consecutive frames. Here, HOF is represented by optical vectors calculated at each pixel. Using the feature 
selection method, optical vectors having maximum value are selected to form a feature descriptor. 


3.3. Neural network 

Two architectures of neural networks are used separately for evaluating the performance of the 
system. The architectures of feed-forward neural network (FFNN) and cascade forward neural network 
(CFNN) [24] are shown in the Figures 2 and 3 respectively. FFNN is the most commonly used neural 
network model in which the input layer, hidden layers, and output layer are used. Figure 2 shows the general 
architecture of FFNN. All the input nodes are connected to all the nodes in 1st hidden layer and all the hidden 
nodes of the last hidden layer are connected to the output layer. The direction of data is only in one direction 
i.e. input to output. The backpropagation algorithm is used to calculate the weights between layers. Multiple 
layers of neurons and a backpropagation algorithm make it possible for the network to learn linear as well as 
nonlinear relations between input and output. 

Cascade forward neural network is similar to FFNN but is having an extra weighted connection 
from the input layer to each hidden layer and from each hidden layer to successive layer. This extra 
connection from input to each layer makes the learning of the network faster. Figure 3 shows the architecture 
of CFNN. 
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Figure 2. The architecture of neural network: FFNN Figure 3. The architecture of neural network: CFNN 


For fair comparison of the performances, same parameters are used for both the neural network 
architectures. The hyperbolic tangent sigmoid function is used as an activation function in the hidden layer 
nodes. Linear function is used as the activation function for the output layer. Levenberg-Marquardt 
backpropagation algorithm, which is preferred for supervised learning and is fastest, is used for learning the 
weights. Mean square error is used as a loss function. 


3.4. Datasets 

Publicly available datasets, namely, Weizmann, KTH, UCF sports, and UT interaction action data 
sets are used for evaluating the performance of FFNN and CFNN architectures. These datasets are selected 
for evaluation because of their distinct properties. In the Weizmann dataset, videos of ten day-to-day actions 
like walking, running, and jumping are incorporated. These actions are performed by nine different actors. 
The recording is done in a controlled environment where only one actor is present in one frame and the 
background is uncluttered. Total ninety videos are available in this dataset. The complexity of the KTH 
dataset is more than the Weizmann dataset. 

In the KTH dataset, six simple actions like hand clapping, waving, and boxing are performed by 25 
different actors. Each action is performed by every actor in four different scenarios. The scenarios used are 
indoor, outdoor, change in scale, and change in the view angle. This dataset is having 600 videos recorded in 
a controlled environment. UCF sports dataset is also having one actor in every video but the recording is 
done at real-time sports events. There are videos of ten different sports like horse riding, golf, and diving. As 
these videos are recorded in real-time, they have varying backgrounds, varying view angles, illumination 
changes, and different scales. This increases the complexity of this dataset. UT interaction dataset is different 
from previously described datasets because it is having two actors in every video. There are six actions like 
hugging, handshaking, punching, pushing, and kicking performed by 10 different pairs. Only the action of 
pointing a finger is performed by a single actor. This dataset is divided into two parts as UT interaction 1 (UT 
1) and UT interaction 2 (UT 2) dataset. In UT 1 dataset, actions are performed in a controlled environment. 
In UT 2 dataset same actions are performed with a cluttered background, partial occlusion, illumination 
changes, and view angle changes. In few videos of the UT 2 dataset, more than two actors are present in a 
frame, making the recognition task more challenging. Figure 4 shows sample frames from all the datasets 
used. Figure 4(a) shows a sample frame from Weizmann dataset of action class ‘walk’. Figure 4(b) shows the 
sample frame of action class ‘walk’ from KTH dataset. Figure 4(c) shows the sample frame from video of 
action class ‘Swing bench’ from UCF Sports dataset. Figure 4(d) shows the sample frame of the action class 
‘shaking hands’ from UT interaction dataset. 


(b) (d) 


Figure 4. Sample frames from: (a) Weizmann, (b) KTH, (c) UCF Sports, and (d) UT interaction 1 datasets 
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For evaluating the performance, 80% of samples are used for the training, 10% for validation, and 
10% for testing. Stratified sampling is used to keep the number of samples of each class proportional to the 
number of samples of that class in the main dataset. Each setup is run 6 times considering different samples 
for training, validation, and testing, and an average of accuracy, precision, and recall are calculated for both 
neural network models. 


3. RESULTS AND DISCUSSION 

Extensive testing was done to evaluate the performance of the HOG+HOF descriptor for the human 
action recognition system. For finding an optimum number of hidden layers to be used, experimentation was 
performed on all data sets with a different number of hidden layers, and accuracy was calculated. Figure 5 
shows a graph of the number of hidden layers plotted against accuracy obtained. The depth of neural 
networks is increased by increasing the number of hidden layers from 5 to 100. 
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Figure 5. Effect of number of hidden layers on average accuracy 


It is observed that recognition accuracy varies from 88% to 97% for different datasets. Also, 
recognition accuracy changes as the number of layers are changed. The time required for training the network 
goes on increasing as the number of layers is increased. With 15 number of hidden layers, good accuracy is 
achieved for all the datasets within optimal time. For all the further evaluations, 15 hidden layers are 
implemented. Figure 6 shows the performance of the neural network model on the UT_2 interaction data set. 
Figures 6(a) and (b) show the validation performance of FFNN and CFNN respectively obtained for UT 
2 interaction dataset. It is seen that mean square error reduces with the number of epochs and after some 
epochs, it is almost constant. For FFNN, mean square error (MSE) reduces almost with same rate for 
training, testing and validation samples. After 10 epochs, MSE converges to the same value and remains 
constant thereafter. On the other hand, for CFNN, MSE reduces fast for training samples as there is a 
connection is present from the input layer to intermediate hidden layers. It is seen that a low value of MSE is 
achieved after only 2 epochs for training samples. For test and validation data samples, MSE does not reduce 
much and remains constant after only one epoch. This shows that because of connections between the input 
layer and every hidden layer, overfitting takes place which results in high MSE for validation and test dataset. 
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Figure 6. Sample validation performance obtained with: (a) FFNN and (b) CFNN on UT data set 
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Error histograms is the another measure for evaluating the performance of the classification model. 
The histogram of classification error plotted against the instances gives the distribution of classification error. 
It is seen that most of the sample points from training, testing as well as validation data fall in the bin of 
0 error for FFNN as well as CFNN. The spread of error histogram is more for FFNN than for CFNN. 

The graph in Figure 7 shows recognition accuracy obtained with FFNN and CFNN architectures. 
The accuracy obtained with both architectures is almost the same for the Weizmann dataset. For the 
remaining all datasets, the accuracy obtained with FFNN is more than that obtained with CFNN architecture. 
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Figure. 7 Comparison of recognition accuracy obtained with FFNN and CFNN architectures 


Table 1 shows the comparison of recognition accuracy obtained for UCF sports, UT interaction, 
Weizmann, and KTH action datasets with state-of-the-art methods presented in the literature. It is seen that 
for the UCF sports dataset, the proposed method with FFNN gives the highest accuracy. For the UT 
interaction dataset, the proposed method with FFNN architecture outperforms all other methods based on 
recognition accuracy. It is observed that for Weizmann and KTH datasets, comparable accuracy is obtained 
by the method proposed in this work. For all the datasets FFNN architecture gives better accuracy than 
CFNN. As in CFNN, the input layer is connected to more layers in the network, it tends to overfit reducing 
the overall recognition accuracy. 


Table 1. Comparative results with state of art methods 


State of art methods UCF sports UT interaction State of art methods Weizmann KTH 
Carvajal et al. [25] 88.6 -- Chaaraoui et al. [26] 92.28% 96.70 
Yi and Lin [27] 90 91.8 Junejo et al. [28] 89% 97.1 
Wang and Qi [29] 92 83.3 Chivers [30] 97% 96.9 
Weng and Guan [31] 92.8 58.2 Siddiqi et al. [11] 81% 80.33 
Cho et al. [32] 89.7 85% H. Naveed et al. [33] 91.69% 92.28 
Nazir et al. [34] 94 -- M. F. Aslan et al. [35] -- 95.33 
Ji et al. [36] -- 83.3 S. Zeng et al. [37] 98.7 -- 
This Work with FENN 90.35 97.59 This Work with FFNN 93.16 94.08 
This work with CFNN 88.85 95.2 This Work with CFNN 93.16 93.28 


4. CONCLUSION 

Experimental results show that, for human action recognition, FFNN gives higher accuracy than 
CFNN. It is observed that mean square error reduces fast for training datasets but stabilizes at a higher level 
for test and validation datasets in the case of CFNN. This shows that, in CFNN, overfitting occurs because of 
weighted connections present between the input layer and all hidden layers. The recognition accuracy 
achieved by CFNN reduces as compared to FFNN because of overfitting. 

In this paper, a fusion of HOG and HOF features is used to describe human actions. HOG and HOF 
features are selected for this task as both of these are global features. As HOG and HOF features are 
extracted from the frame as a whole, the requirement of the crucial task of segmentation and foreground 
extraction is eliminated. A combination of HOG, which gives spatial information, and HOF which gives 
motion information, form a strong feature descriptor. The recognition accuracy will vary as per the fetures 
selected for representing the actions. In this work as the focus is on comparison of neural network 
architectures, various features are not explored. Comparison of results obtained in this work with the 
previous state of art methods shows that for Weizmann and KTH datasets, recognition accuracy obtained is 
comparable with other methods. For UCF sports and UT interaction datasets, which are more complex, 
recognition accuracy outperforms other methods. 
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